在此技术报告中,我们提出了我们的解决方案,称为MV-FCOS3D ++,适用于Waymo Open DataSet Challenge的仅摄像头3D检测轨道2022.仅使用birde-eye-view或3D检测多视图摄像头3D检测几何表示可以利用相邻视图之间重叠区域的立体声提示,而无需手工制作的后处理即可直接执行3D检测。但是,它缺乏对2D骨架的直接语义监督,可以通过预处理简单的单眼探测器来补充。我们的解决方案是此范式之后用于4D检测的多视图框架。它是基于简单的单眼检测器FCOS3D ++构建的,仅通过Waymo的对象注释进行了预定,并将多视图功能转换为3D网格空间以检测其上的3D对象。设计了单帧理解和时间立体声匹配的双路径颈部,以结合多帧信息。我们的方法最终通过单个模型实现了49.75%的MAPL,并在WOD挑战中赢得了第二名,而在训练过程中没有任何基于激光雷达的深度监督。该代码将在https://github.com/tai-wang/depth-from-motion上发布。
translated by 谷歌翻译
由于医学图像的数据稀缺性和数据异质性是普遍存在的,因此在部署到新站点时,使用先前的归一化方法训练有素的卷积神经网络(CNN)可能会表现不佳。但是,现实世界应用程序的可靠模型应该能够在分布(IND)和分布(OOD)数据(例如新站点数据)上很好地概括。在这项研究中,我们提出了一种称为窗口归一化(WIN)的新型归一化技术,这是现有标准化方法的简单而有效的替代方法。具体而言,赢得了与特征窗口上计算的本地统计数据的归一化统计数据。此功能级增强技术可以很好地规范模型,并显着改善了其OOD的概括。利用它的优势,我们提出了一种称为Win Win的新型自我鉴定方法,以进一步改善分类中的OOD概括。通过两次向前传球和一致性约束可以轻松实现双赢,这对于现有方法来说是一个简单的扩展。关于各种任务(例如青光眼检测,乳腺癌检测,染色体分类,视盘和杯赛分割等)和数据集(26个数据集)的广泛实验结果证明了我们方法的一般性和有效性。该代码可从https://github.com/joe1chief/windownormalizaion获得。
translated by 谷歌翻译
视频框架插值(VFI)旨在通过从双向历史参考文献中扭曲可学习的动作来产生预测帧。大多数现有的作品都利用时空语义信息提取器来实现运动估计和插值建模,考虑到产生的中间运动的实际机械合理性,没有足够的考虑。在本文中,我们将VFI重新制定为多变量的非线性(MNL)回归问题,并提出了联合非线性运动回归(JNMR)策略来模拟框架间的复杂运动。为了建立MNL回归,采用ConvlSTM来构建时间维度的完整运动的分布。目标框架和多个参考帧之间的运动相关性可以通过建模的分布进行回归。此外,功能学习网络旨在为MNL回归建模进行优化。进一步进行了一个粗到精细的合成增强模块,以通过重复回归和插值来学习不同分辨率的视觉动力学。框架插值上的高度竞争性实验结果表明,与最先进的性能相比,有效性和显着提高,以及复杂运动估计的鲁棒性通过MNL运动回归提高。
translated by 谷歌翻译
基于采样的路径规划算法通常实现均匀的采样方法来搜索状态空间。然而,统一的采样可能导致许多情况下不必要的探索,例如具有几个死角的环境。我们以前的工作建议使用有希望的区域来指导采样过程来解决问题。然而,预测的有希望区域通常是断开连接,这意味着它们无法连接到开始和目标状态,导致缺乏概率完整性。这项工作侧重于提高预测有前途地区的连通性。我们所提出的方法在x和y方向上回归边缘的连接概率。此外,它可以计算丢失中有希望的边缘的重量,以引导神经网络更加关注有前景区域的连通性。我们进行一系列仿真实验,结果表明,有前途地区的连接性显着提高。此外,我们分析了连接基于采样的路径规划算法的影响,并得出结论,连接在维护算法性能方面发挥着重要作用。
translated by 谷歌翻译
最近,卷积神经网络(CNN)技术具有普及作为高光谱图像分类(HSIC)的工具。为了在有限样品的条件下提高HSIC的特征提取效率,目前的方法通常使用大量层的深层模型。然而,当样品有限时,深网络模型容易出现过度拟合和梯度消失问题。此外,空间分辨率严重降低,深度深度,这对空间边缘特征提取非常有害。因此,这封信提出了一种HSIC的浅模型,称为深度过度参数化卷积神经网络(DOCNN)。为了确保浅模型的有效提取,引入深度过度参数化卷积(DO-CONV)内核以提取歧视特征。深度过度参数化卷积内核由标准卷积内核和深度卷积内核组成,其可以单独地提取不同信道的空间特征,并同时熔合整个通道的空间特征。此外,为了进一步减少由于卷积操作引起的空间边缘特征的损失,提出了一种密集的残余连接(DRC)结构以适用于整个网络的特征提取部分。从三个基准数据集获得的实验结果表明,该方法在分类准确度和计算效率方面优于其他最先进的方法。
translated by 谷歌翻译
Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.
translated by 谷歌翻译
To generate high quality rendering images for real time applications, it is often to trace only a few samples-per-pixel (spp) at a lower resolution and then supersample to the high resolution. Based on the observation that the rendered pixels at a low resolution are typically highly aliased, we present a novel method for neural supersampling based on ray tracing 1/4-spp samples at the high resolution. Our key insight is that the ray-traced samples at the target resolution are accurate and reliable, which makes the supersampling an interpolation problem. We present a mask-reinforced neural network to reconstruct and interpolate high-quality image sequences. First, a novel temporal accumulation network is introduced to compute the correlation between current and previous features to significantly improve their temporal stability. Then a reconstruct network based on a multi-scale U-Net with skip connections is adopted for reconstruction and generation of the desired high-resolution image. Experimental results and comparisons have shown that our proposed method can generate higher quality results of supersampling, without increasing the total number of ray-tracing samples, over current state-of-the-art methods.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译